﻿ Tehnicide IngineriaLimbajuluiNatural Curs 1 Niveluride prelucrareapplicate limbajuluinatural Construcțiade resurse, prelucrăriinițialeale documentelorșinivelulsubsintactic Curs: Dan Cristea Laboratoare: Diana Trandabăț, Mihaela Onofrei, Daniela Gîfu, IonuțPistol Cât de departe poate merge mașina în înțelegerea limbajului? Tehnologiile limbajului natural probează capacitatea de a utiliza limbajul natural în aplicații •Proba supremă: “înțelegerea”textelor => capacitatea de a reacționa corect la mesajul codificat în text –niveluri de analiză: •lexical •morfologic •sintactic •semantic •discurs •pragmatic How to extract the content of texts? •Content (semantic) = the objective knowledge, that one which can be similarly identified by a large collectivity of humans •Understanding language puts to work a diversity of linguistic backgrounds (innate, acquired): –phonological, morphological, lexicalSee the Piaget óChomsky bate: innate óacquired –syntacticde –semanticAll these layers must be –discourse reproduced on machine –pragmatic However, humans process texts differently… •We are not conscious about all these layers of processing •We can easily recognise erroneous messages, by skipping over errors (with respect to morphology, syntax, etc ) –therefore, our way of treating language is more like associations-based than rules-based –we integrate, combine, different sources of knowledge when taking language decisions –when educated, we recognise errors but we are not mislead by them Modules like this one can be organised in chains Language independent module Language specific resources Modules like this one can be organised in chains Language independent module APPROACHES:Language specific symbolicresources statistical neural Modules like this one can be organised in chains Language independent modulecorpora treebanks wordnets verbnets language models (neural, statistical) … APPROACHES:Language specific symbolicresources statistical neural Creation of linguistic resources corpora treebanks wordnets verbnets language models (neural, statistical) … Language specific resources Cum se obțin resursele? Pasul 1: extragerea expertizei umane texttext adnotat Cum se obțin resursele? Pasul 2: sinteza modelelor program de set de învățare/mining/reguli/corelații/ antrenarerețea Exemplu: un parser sintactic (program capabil să extragă arborele sintactic al unei fraze) Parser: software independent de limbă set de reguli simbolice/corelații/ rețeapentrulimbaL Cum se obțin resursele? Pasul 3: evaluarea text Parser sintactic limbă comparare => set de reguli evaluare simbolice/corelații/ rețeapentrulimbaL A language processing pipeline INITIAL SUB-SYNTACTIC SYNTACTIC text PROCESSINGPROCESSINGPROCESSING SEMANTIC DISCOURSE PRAGMATIC PROCESSINGPROCESSINGresultPROCESSING The document layer: processing old texts INITIAL SYNTACTIC text/imaSUB-SYNTACTIC PROCESSINGPROCESSINGgePROCESSING SEMANTIC DISCOURSE PRAGMATIC PROCESSINGPROCESSINGresultPROCESSING The document layer: processing old texts INITIAL SYNTACTIC text/imaSUB-SYNTACTIC PROCESSINGPROCESSINGgePROCESSING SEMANTIC DISCOURSE PRAGMATIC PROCESSINGPROCESSINGresultPROCESSING INTERPRETATIVE IMAGE TRANSCRIPTIONOCRSEGMENTATIONimage CyRo –build a technology that interprets old Cyrillic Romanian •Train OCR classifiers to decode printed, semi-uncial and cursive Cyrillic Romanian documents •Ambitious goals of a mixt consortium –library curators –paleolinguists –image processing experts –computational linguists drd Cristian Pădurariu Probleme: -pete -deteriorări -nealinieri -distorsiuni -set neuniform de caractere -diacritice -scrieri printre rânduri ori pe manșetă -etc The sub-syntactic layer INITIAL SUB-SYNTACTIC textSYNTACTIC PROCESSINGPROCESSINGPROCESSING SEMANTIC DISCOURSE PRAGMATIC PROCESSINGPROCESSINGresultPROCESSING The sub-syntactic layer INITIAL SUB-SYNTACTIC textSYNTACTIC PROCESSINGPROCESSINGPROCESSING SEMANTIC DISCOURSE PRAGMATIC PROCESSINGPROCESSINGresultPROCESSING SENTENCE RECOGNIZE BORDERSTOKENIZATIONPOS-TAGGINGLEMMASNP CHUNKINGSometimes more such steps are short- circuited in the natural brain Example of technologies: Google Translate •Example based translation Etichetare morfologică dr Radu Simionescu